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?-H Summary. A graph is a structure composed of a set of vertices (i.e. nodes, dots) 

connected to one another by a set of edges (i.e. Hnks, lines). The concept of a graph 

has been around since the late IQ**" century, however, only in recent decades has there 
been a strong resurgence in both theoretical and applied graph research in mathe- 
matics, physics, and computer science. In applied computing, since the late 1960s, 
the interlinked table structure of the relational database has been the predominant 
information storage and retrieval model. With the growth of graph/network-based 
data and the need to efficiently process such data, new data management systems 
have been developed. In contrast to the index-intensive, set-theoretic operations of 
^ relational databases, graph databases make use of index-free, local traversals. This 

article discusses the graph traversal pattern and its use in computing. 
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1 Introduction 



The first paragraph of any pubUcation on graphs usually contains the iconic 

G = {V, E) definition of a graph. This definition states that a graph is com- 
posed of a set of vertices V and a set of edges E. Normally following this 
definition is the definition of the set E. For directed graphs, E C. {V x V) 
and for undirected graphs, E Q {V x V} . That is, S is a subset of all ordered 
or unordered permutations of V element pairings. From a purely theoreti- 
cal standpoint, such definitions are usually sufficient for deriving theorems. 
However, in applied research, where the graph is required to be embedded in 
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J-j reality, this definition says little about a graph's realization. The structure a 

graph takes in the real-world determines the efficiency of the operations that 
are applied to it. It is exactly those efficient graph operations that yield an 
unconventional problem-solving style. This style of interaction is dubbed the 
graph traversal pattern and forms the primary point of discussion for this 
article.^ 



^ The term pattern refers to data modeling/processing patterns found in computing 
such as the relational pattern, the map-reduce pattern, etc. In this sense, a pattern 



2 



Marko A. Rodriguez^ and Peter Neubauer^ 



2 The Realization of Graphs 

Relational databases have been around since the late 1960s [2] and are to- 
days most predominate data management tool. Relational databases main- 
tain a collection of tables. Each table can be defined by a set of rows and a 
set of columns. Semantically, rows denote objects and columns denote prop- 
erties/attributes. Thus, the datum at a particular row/column-entry is the 
value of the column property for that row object. Usually, a problem domain 
is modeled over multiple tables in order to avoid data duplication. This pro- 
cess is known as data normalization. In order to unify data in disparate tables, 
a "join" is used. A join combines two tables when columns of one table refer to 
columns of another table. Maintaining these references in a consistent state is 
known as a referential integrity. This is the classic relational database design 
which affords them their flexibility [11]. 

In stark contrast, graph databases do not store data in disparate tables. 
Instead there is a single data structure — the graph. Moreover, there is no 
concept of a "join" operation as every vertex and edge has a direct refer- 
ence to its adjacent vertex or edge. The data structure is already "joined" by 
the edges that are defined. There are benefits and drawbacks to this model. 
First, the primary drawback is that its difficult to shard a graph (a difficulty 
also encountered with relational databases that maintain referential integrity). 
Sharding is the process of partitioning data across multiple machines in or- 
der to scale a system horizontally.'* In a graph, with unconstrained, direct 
references between vertices and edges, there usually does not exist a clean 
data partition. Thus, it becomes difficult to scale graph databases beyond the 
confines of a single machine and at the same time, maintain the speed of a 
traversal across sharded borders. However, at the expense of this drawback 
there is a significant advantage: there is a constant time cost for retrieving an 
adjacent vertex or edge. That is, regardless of the size of the graph as a whole, 
the cost of a local read operation at a vertex or edge remains constant. This 
benefit is so important that it creates the primary means by which users inter- 
act with graph databases travcrsals. Graphs offer a unique vantage point on 
data, where the solution to a problem is seen as abstractly defined traversals 
through its vertices and edges. ^ 

is a way of approaching a data-centric problem that usually has benefits in terms 

of efficiency and/or oxpressibility. 

* Sharding is easily solved by other database architectures such as key/value stores 
[3] and document databases [8]. In such systems, there is uo explicit linking be- 
tween data in different "collections" (i.e. documents, key/value pairs). Strict par- 
titions of data make it easier to horizontally scale a database [18]. 

^ The space of graph databases is relatively new. While it is possible to model and 
process a graph in most any type of database (e.g. relational databases, key/value 
stores, document databases), a graph database, in the context of this article, is 
one that makes use of direct references between adjacent vertices and edges. As 
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2.1 The Indices of Relational Tables 

Imagine that there is a gremhn who is holding a number between 1 and 100 in 
memory. Moreover, assume that when guessing the number, the gremlin will 
only reply by saying whether the guessed number is greater than, less than, or 
equal to the number in memory. What is the best strategy for determining the 
number in the fewest guesses? On average, the quickest way to determine the 
number is to partition the space of guesses into equal size chunks. For example, 
ask if the number is 50. If the gremlin states that its less than 50, then ask, 
is the number 25? If greater than 25, then ask, is the number 37? Follow 
this partition scheme until the number is converged upon. The structure that 
these guesses form over the sequence from 1 to 100 is a binary search tree. On 
average, this tree structure is more efficient in time than guessing each number 
starting from 1 and going to 100. This is ultimately the difference between 
an index-based search and a linear search. If there were no indices for a set, 
every element of the set would have to be examined to determine if it has a 
particular property of interest.^ For n elements, a linear scan of this nature 
runs in 0{n). When elements are indexed, there exists two structures — the 
original set of elements and an index of those elements. Typical indices have 
the convenient property that searching them takes ©(loggn). For massive sets, 
the space that indices take is well worth their weight in time. 

Relational databases take significant advantage of such indices. It is 
through indices that rows with a column value are efficiently found. More- 
over, the index makes it possible to efficiently join tables together in order to 
move between tables that are linked by particular columns. Assume a simple 
example where there are two tables: a person table and a friend table. The 
person tabic has the following two columns: unique identifier and name. 
The friend table has the following two columns: person.a and person_b. The 
semantics of the friend table is that person a is friends with person b. Sup- 
pose the problem of determining the name of all of Alberto Pepe's friends. 
Figure 1 and the following list breaks down this simple query into all the 
micro-operations that must occur to yield results.'' 

1. Query the person. name index to find the row in person with the name 
"Alberto Pope." [©(loggn)] 

2. Given the person row returned by the index, get the identifier for that 
row. [0(1)] 

such, graph databases arc those systems that arc optimized for graph traversals. 
The Neo4j graph database is an example of such a database [7]. 
® In a relational database, this process is known as a full table scan. 
Assume that the number of rows in person is n and the number of rows in friend 
is m. Moreover, for the sake of simplicity, assume that names, like identifiers, in 
the person table are unique. 
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Fig. 1. A table representation of people and their friends. 



3. Query the friend. person_a index to find all the rows in friend with the 
identifier from previous, [©(logjx) : x ^ m]^ 

4. Given each of the k rows returned, get the person_b identifier for those 
rows. [0{k)] 

5. For each k friend identifiers, query the person, identifier index for the 
row with friend identifier. [0{k log2ri)] 

6. Given the k person rows, get the name value for those rows. 

The final operation yields the names of Alberto's friends. This example elu- 
cidates the classic join operation utilized in relational databases. By being 
able to join the person and friend table, its possible to move from a name, 
to the person, to his or her friends, and then, ultimately, to their names. In 
effect, the join operation forms a graph that is dynamically constructed as 
one table is linked to another table. While having the benefit of being able to 
dynamically construct graphs, the limitation is that this graph is not explicit 
in the relational structure, but instead must be inferred through a series of 
index-intensive operations. Moreover, while only a particular subset of the 
data in the database may be desired (e.g. only Alberto's friend's), all data 
in all queried tables must be examined in order to extract the desired subset 
(e.g. all friends of all people). Even though a ©(logan) read-time is fast for 
a search, as the the indices grow larger with the growth of the data and as 
more join operations are used, this model becomes inefficient. At the limit, the 
inferred graph that is constructed through joins is best solved (with respects 
to time), by a graph database. 



Given that an individual will have many friends, the number of index nodes in 
the friencl.person_a index will be much less than m. 
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2.2 The Graph as an Index 

Most of graph theory is concerned with the development of theorems for single- 
relational graphs [1]. A single-relational graph maintains a set of edges, where 
all the edges are homogeneous in meaning. For example, all edges denote 
friendship or kinship, but not both together within the same structure. In 
application, complex domain models are more conveniently represented by 
multi-relational, property graphs.^ The edges in a property graph are typed 
or labeled and thus, edges are heterogenous in meaning. For example, a prop- 
erty graph can model friendship, kinship, business, communication, etc. re- 
lationships all within the same structure. Moreover, vertices and edges in a 
property graph maintain a set of key/value pairs. These are known as proper- 
ties and allow for the representation of non-graphical data — e.g. the name of 
a vertex, the weight of an edge, etc. Formally, a property graph can be defined 
as G = {V,E,X,^.), where edges are directed (i.e. E C [V x V)), edges are 
labeled (i.e. X : E ^ S), and properties are a map from elements and keys to 
values (i.e. fi : {V U E) x R ^ S). 

In the property graph model, it is common for the properties of the ver- 
tices (and sometimes edges) to be indexed using a tree structure analogous, 
in many ways, to those used by relational databases. This index can be rep- 
resented by some external indexing system or endogenous to the graph as 
an embedded tree (see §3.2).^'^ Given the prior situation, once a set of ele- 
ments have been identified by the index search, then a traversal is executed 
through the graph. Elements in a graph are adjacent to one another by di- 
rect references. A vertex is adjacent to its incoming and outgoing edges and 
an edge is adjacent to its outgoing (i.e. tail) and incoming (i.e. head) ver- 
tices. The domain model defines how the elements of the problem space are 
related. Similar to the gremlin stating that 50 is greater than the number to 
be guessed, an edge connecting vertex i and j and labeled friend states that 
vertex i is friend related to vertex j. Indices create "short cuts" in the graph 
as they partition elements according to specialized, compute-centric semantics 
(e.g. numbers being less than or greater than another). Likewise, a domain 

^ In the parlance of graphs, a property graph is a directed, edgc-labolcd, attributed 
multi-graph. For the sake of simplicity, such structures will simply be called prop- 
erty graphs. These types of graph structures arc used extensively in computing 
as they are more expressive than the simplified mathematical objects studied in 
theory. However, note that expressiveness is defined by ease of use, not by the 
limits of what can be modeled [15]. 

The reason for using an external indexing system is that it may be optimized for 

certain types of lookups such as full-text search. 

This is ultimately what is accomplished in a relational database when a row of 
a table is located and a value in a column of that row is fetched (e.g. see the 
second micro-operation of the relational database enumeration previous). How- 
ever, when that row doesn't have all the requisite data (usually do to database 
normalization), it requires the joining with another table to locate that data. It 
is this situation which is costly in a relational database. 
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model partitions elements using semantics defined by the domain modeler. 
Thus, in many ways, a graph can be seen as an indexing structure. 

In the relational example previous, a person in the person table has two 
properties: a unique identifier and a name. The analogue in a property 
graph would be to have the identifier and name values represented as vertex 
properties. Moreover, the friend table would not exist as a table, but as direct 
friend-labeled edges between vertices. This idea is diagrammed in Figure 2. 
The micro-operations used to find the name of all of Alberto Pepe's friends 
are provided in the following enumeration. 



vertex.name index 
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Fig. 2. A graph representation of people and their friends. Given the tree-nature 
of the vertex.name index, it is possible, and many times useful to model the index 
endogenous to the graph (see §3.2). 



1. Query the vertex.name index to find all the vertices in G with the name 

"Alberto Pepe." [Oilog^n)] 

2. Given the vertex returned, get the k friend edges emanating from this 
vertex. [0{k + x)]'^'^ 

3. Given the k friend edges retrieved, get the k vertices on the heads of 
those edges. [C(/c)] 

4. Given these k vertices, get the k name properties of these vertices. 

^ If a graph database docs not index the edges of a vertex by their labels, then a 
linear scan of all edges emanating from a vertex must occur to locate the set of 
friend-labeled edges. Thus, A; -|- a; is the total number of edges emanating from 
the current vertex. 

^ If a graph database does not index the properties of a vertex, then a linear scan 

of all the properties must occur. If y is the total number of properties on the 
vertices (assuming a homogenous count for all vertices), then, in the worst case 
scenario, ky elements must be examined. 
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The final operation yields the names of Alberto's friends. In a graph database, 
there is no explicit join operation because vertices maintain direct references 
to their adjacent edges. In many ways, the edges of the graph serve as explicit, 
"hard-wired" join structures (i.e. structures that are not computed at query 
time as in a relational database). The act of traversing over an edge is the 
act of joining. However, what makes this more efficient in a graph database 
is that traversing from one vertex to another is a constant time operation. 
Thus, traversal time is defined solely by the number of elements touched by 
the traversal. This is irrespective of the size/topology of the graph as a whole. 
The time it takes to make a single step in a traversal is determined by the local 
topology of the subgraph surrounding the particular vertex being traversed 
from.^** 

The real power of graph databases makes itself apparent when traversing 
multiple steps in order to unite disparate vertices by a path (i.e. vertices not 
directly connected). First, there are no C'(log2n) operations. Second, the type 
of path taken, defines the "higher order," inferred relationship that exists 
between two vertices. -'^^ Traversals based on abstractly defined paths is the 
core of the graph traversal pattern. The next section discusses the graph 
traversal pattern and its application to common problem-solving situations. 



3 Graph Traversals 

A traversal refers to visiting elements (i.e. vertices and edges) in a graph 
in some algorithmic fashion. This section will present a functional, flow- 
based approach [13] to traversing property graphs and how different types 
of traversals over different types of graph datasets support different types of 
problem-solving. 

The most primitive, read-based operation on a graph is a single step traver- 
sal from element i to element j, where i,j £ {VU E)}"^ For example, a single 
step operation can answer questions such as "which edges are outgoing from 
this vertex?", "which vertex is at the head of this edge?", etc. Single step 
operations expose explicit adjacencies in the graph (i.e. adjacencies that are 

The consequence of this is that traversing through a "super node" (i.e. a high- 
degree vertex) in a grapli is slower than traversing through a small-degree vertex. 
^® In many ways, this is the graph equivalent of the join operation used by relational 
databases — though no global indices are used. When traversing a multi-step path, 
the source and sink vertices are united by a semantic determined by the path 
taken. For example, going from a person, to their friends, and then to their friends 
friends, will unite that person to people two-steps away in the graph. This popular 
path is known FOAF (friend of a friend). 

In general, the term "algorithm" is used in a looser sense than the clEissic definition 
in that it allows for randomization and sampling when traversing. 
While it is possible to write and delete elements from a graph, such operations 
will not be discussed. 
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"hard-wired"). The following list itemizes the various types of single step 
travcrsals. Note that these operations arc defined over power multiset do- 
mains and ranges. The reason for this is that is naturally allows for function 
composition, where a composition is a formal description of a traversal. 

• Cout : "PiV) — >■ 'PiE)'- traverse to the outgoing edges of the vertices. 

• Gin : PiV) T^iE): traverse to the incoming edges to the vertices. 

• ^^out '■ PiE) — >■ Viy)'- traverse to the outgoing (i.e. tail) vertices of the 
edges. 

• Win : PiE) ViV): traverse the incoming (i.e. head) vertices of the edges. 

• e : V{V U E) X R ^ T^iS)'- get the clement property values for key r E R. 

When edges are labeled and elements have properties, it is desirable to con- 
strain the traversal to edges of a particular label or elements with particular 
properties. These operations are known as filters and are abstractly defined 
in the following itemization. ^'^ 

• eiab± : ViE) X S ^ 'P{E): allow (or filter) all edges with the label a £ S. 

• ep± : ViV UE) X Rx S ^ V{V U E)\ allow (or filter) all elements with 
the property s G S' for key r G R. 

• ee± : f'iV UE) x{VU E) V{V U E): allow (or filter) all elements that 
are the provided element. 

Through function composition, we can define graph traversals of arbitrary 

length. A simple example is traversing to the names of Alberto Pepe's friends. 
If i is the vertex representing Alberto Pepe and 

/ : V{V) ^ V{S), 

where 

f{i) = e (vin (eiab+ (eout(«), friend)) , name) , 

then f{i) will return the names of Alberto Pepe's friends. Through function 
currying and composition, the previous definition can be represented more 
clearly with the following function rule, 

/(i) = (e--<^ot;i„oef™^'ioeo„t) {i)- 

The function / says, traverse to the outgoing edges of vertex i, then only allow 

The power set of set A is denoted Vi^A) and is tlie set of all subsets of A (i.e. 2'^). 
The power multiset of A, denoted ■p(A), is the infinite set of all subsets of multisets 
of A. This set is infinite because multisets allow for repeated elements [12]. 
The path algebra defined in [16] operates over multi-relational graphs represented 
as a tensor. Besides the inclusion of vertex/edge properties used in this article, 
the tensor-based path algebra has the same expressivity as the functional model 
presented in this section. 

Filters can be defined as allowing or disallowing certain elements. For allowing, 
the symbol -|- is used. For disallowing, the symbol — is used. 
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Fig. 3. A single path along along the / traversal. 



those edges with the label friend, then traverse to the incoming (i.e. head) 
vertices on those friend-labeled edges. Finally, of those vertices, return their 
name property.^^ A single legal path according to this function is diagrammed 
in Figure 3. Though not diagrammed for the sake of clarity, the traversal would 
also go from vertex 1 to the name of vertex 2 and vertex 3. The function / 
is a "higlior-ordcr" adjacency defined as the composition of explicit adjacen- 
cies and serves as a join of Alberto and his friend's names. The remainder 
of this section demonstrates graph traversals in real-world problems-solving 
situations. 



3.1 Traversing for Recommendation 

Recommendation systems are designed to help people deal with the problem 
of information overload by filtering information in the system that doesn't 
pertain to the person [14]. In a positive sense, recommendation systems focus 
a person's attention on those resources that are likely to be most relevant 
to their particular situation. There is a standard dichotomy in recommenda- 
tion research — that of content- vs. collaborative filtering-based recommenda- 
tion. The prior deals with recommending resources that share characteristics 
(i.e. content) with a set of resources. The latter is concerned with determining 
the similarity of resources based upon the similarity of the taste of the peo- 
ple modeled within the system [6]. These two seemingly different techniques 
to recommendation are conveniently solved using a graph database and two 
simple traversal techniques [10, 5]. Figure 4 presents a toy graph data set, 
where there exist a set of people, resources, and features related to each other 
by likes- and feature-labeled edges. This simple data set is used for the 
remaining examples of this subsection. 

Note that the order of a composition is evaluated from right to left. 
This is known as a virtual edge in the graph system called DEX [9] . 
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Fig. 4. A graph data structure containing people (p), their liked resources (r), and 
each resource's features (f). 



Content-Based Recommendation 



= (4- 



In order to identify resources that that are similar in features (i.e. content- 
based recommendation) to a resource, traverse to all resources that share the 
same features. This is accomplished with the following function, / : V{V) 
T{V), where 

.°«inoe£f-<^oeout) « 

Assuming i = 3, function / states, traverse to the outgoing edges of resource 
vertex 3, only allow feature-labeled edges, and then traverse to the incoming 
vertices of those feature-labeled edges. At this point, the traverser is at 
feature vertex 8. Next, traverse to the incoming edges of feature vertex 8, 
only allow feature-labeled edges, and then traverse to the outgoing vertices of 
these feature-labeled edges. At this point, the traverser is at resource vertices 
3 and 2. However, since we are trying to identify those resources similar in 
content to vertex 3, we need to filter out vertex 3. This is accomplished by 
the last stage of the function composition. Thus, given the toy graph data 
set, vertex 2 is similar to vertex 3 in content. This traversal is diagrammed in 
Figure 5. 

Its simple to extend content-based recommendation to problems such as: 
"Given what person i likes, what other resources have similar features?" Such 
a problem is solved using the previous function / defined above combined 
with a new composition that finds all the resources that person i likes. Thus, 
if g : ViV) r{V), where 



g{i) = (fin o ej!,t^ o Cout) (i), 
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Fig. 5. A traversal that identifies resources that are similar in content to a set of 
resources based upon shared features. 



then to determine those resources similar in features to the resources that 
person vertex 7 likes, compose function / and g: (/ o g){7). Those resources 
that share more features in common will be returned more by / o g.'^^ 

What has been presented is an example of the use of traversals to do 
naive content-based recommendation. It is possible to extend the functions 
presented to normalize paths (e.g. a resource can have every feature and thus, 
is related to everything), find novelty (e.g. feature paths that are rare and only 
shared by a certain subset of resources) , etc. In most cases, when creating a 
graph traversal, a developer will compose different predefined paths into a 
longer compositions. Along with speed of execution, this is one of the benefits 
of using a functional, flow-based model for graph traversals [19]. Moreover, 
each component has a high-level meaning (e.g. the resources that a person 
likes) and as such, the verbosity of longer compositions can be minimal (e.g. fo 
9)- 

Collaborative Filtering-Based Recommendation 

With collaborative filtering, the objective is to identify a set of resources that 
have a high probability of being liked by a person based upon identiiying other 
people in the system that enjoy similar likes. For example, if person a and 
person b share 90% of their liked resources in common, then the remaining 10% 
they don't share in common are candidates for recommendation. Solving the 
problem of collaborative filtering using graph traversals can be accomplished 

Again, path traversal functions are defined over power multisets. In this way, its 
possible for a function to return repeated elements. In some situations, dedupli- 
cating this set is desired. In other situations, repeated elements can be used to 
weight /rank the results. 
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with the foUowing traversal. For the sake of clarity, the traversal is broken into 
two components: / and g, where / : ViV) V{V) and g : V{V) V{V). 

f{i) = {e\_ o Wout o el^bV ° ^in o Win o elabV ° ^out) («)• 

Function / traverses to all those people vertices that like the same resonrces 
as person vertex i and who themselves are not vertex i (as a person is ob- 
viously similar to themselves and thus, doesn't contribute anything to the 
computation) . The; more resources liked that a person sharc;s in common with 
i, the more traversers will be located at that person's vertex. In other words, 
if person i and person j share 10 liked resources in common, then f{i) will 
return person j 10 times. Next, function g is defined as 

gij) = (t^in o elfgi^. o eout) (j)- 

Function g traverses to all the resoiu'ccis liked by vertex j. In composition, 
{9 ° /) (*) determines all those resources that are liked by those people that 
have similar tastes to vertex i. If person j likes 10 resources in common with 
person i, then the resources that person j likes will be returned at least 10 
times by 5 o / (perhaps more if a path exists to those resources from another 
person vertex as well). Figure 6 diagrams a function path starting from vertex 
7. Only one legal path is presented for the sake of diagram clarity. 
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Fig. 6. A traversal tliat identifies resources that are similar in content to a resource 
based upon shared features. 



With the graph traversal pattern, there exists a single graph data struc- 
ture that can be traversed in different ways to expose different types of 
recommendations — generally, different types of relationships between vertices. 
Being able to mix and match the types of traversals executed alters the seman- 
tics of the final rankings and conveniently allows for hybrid recommendation 
algorithms to emerge. 
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3.2 Traversing Endogenous Indices 

A graph is a general-purpose data structure. A graph can be used to model 
lists, maps, trees, etc. As such, a graph can model an index. It was assumed, 
in §2.2, that a graph database makes use of an external indexing system to 
index the properties of its vertices and edges. The reason stated was that spe- 
cialized indexing systems are better suited for special-purpose queries such as 
those involving full-text search. However, in many cases, there is nothing that 
prevents the representation of an index within the graph itself — vertices and 
edges can be indexed by other vertices and edges. In fact, given the nature of 
how vertices and edges directly reference each other in a graph database, index 
look-up speeds are comparable. Endogenous indices afford graph databases a 
great flexibility in modeling a domain. Not only can objects and their rela- 
tionships be modeled (e.g. people and their friendships), but also the indices 
that partition the objects into meaningful subsets (e.g. people within a 2D 
region of space). The remainder of this subsection will discuss the represen- 
tation and traversal of a spatial, 2D-index that is explicitly modeled within a 
property graph. 

The domain of spatial analysis makes use of advanced indexing structures 
such as the quadtree [4, 17]. Quadtrees partition a two-dimensional plane into 
rectangular boxes based upon the spatial density of the points being indexed. 
Figure 7 diagrams how space is partitioned as the density of points increases 
within a region of the index. 





















o- 


u 










1? 






-o 


r 




-o 







o 




o 


























o- 








c 


o 




o 


< 


> 


c 




o 



Fig. 7. A quadtree partition of a plane. This figure is an adaptation of a public 
domain image provided courtesy of David Eppstein. 

One of the primary motivations behind this article is to stress the importance of 
thinking of a graph as simply an index of itself, where the primary purpose is to 
traverse the various defined indices in ways that elicit problem-solving within the 

domain being modeled. 

Those indices that have a graph-like structure are suited for representing as a 
graph. It is noted that not all indices meet this criteria. 
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In order to demonstrate how a quadtree index can be represented and tra- 
versed, a toy graph data set is presented. This data set is diagrammed in Fig- 
ure 8. The top half of Figure 8 represents a quadtree index (vertices 1-9). This 
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Fig. 8. A quadtree index of a space that contains points of interest. The index is 
composed of the vertices 1-9 and the points of interest are the vertices a-i. While not 
diagrammed for the salce of clarity, all edges are labeled sub (meaning subsumes) and 
each point of interest vertex has an associated bottom-left (bl) property, top-right 
(tr) property, and a type property which is equal to "poi." 



quadtree index is partitioning "points of interest" (vertices a-i) located within 
the diagrammed plane. All vertices maintain three properties — bottom-left 

The plane depicted does not actually exist as a data structure, but is represented 
here to denote how the different vertices lying on that plane are spatially located 
(i.e. spatial information is represented explicitly in the properties of the vertices). 
Thus, vertices closer to each other on the plane are closer together. 
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(bl), top-right (tr), and type. For a quadtree vertex, these properties identify 
the two corner points defining a rectangular bounding box (i.e. the region that 
the quadtree vertex is indexing) and the vertex type which is equal to "quad" . 
For a point of interest vertex, these properties denote the region of space that 
the point of interest exists within and the vertex type which is equal to "poi." 

Quadtree vertex 1 denotes the entire region of space being indexed. This 
region is defined by its bottom-left (bl) and top-right (tr) corner points — 
namely [0, 0] and [100, 100], where b^ = 0, blj, 0, tr^ 100, and tr^^ = 100. 
Within the region defined by vertex 1, there are 8 other defined regions that 
partition that space into smaller spaces (vertices 2-9). When one vertex sub- 
sumes another vertex by a directed edge labeled sub (i.e. subsumes), the 
outgoing (i.e. tail) vertex is subsuming the space that is defined by the in- 
coming (i.e. head) vertex. Given these properties and edges, identifying point 
of interest vertices within a region of space is simply a matter of traversing 
the quadtree index in a directed/algorithmic fashion. 

In Figure 8, the shaded region represents the spatial query: "Which points 
of interest arc within the rectangular region defined by the corner points 
bl = [25, 20] and tr = [90, 45]?" In order to locate all the points of interest in 
this region, iteratively execute the following traversal starting from the root of 
the quadtree index (i.e. vertex 1). The function is defined as / : V{V) ViV), 
where 

/ tr„>20 tT^>25 bl„<45 bUOO sub ^ / 

/(«) = [e^l- o ep^- o ep_^- o ep_^- o Vin o e^^b+ ° Sout j («)• 

The defining aspect of / is the set of 4 ep+ filters that determine whether the 

current vertex is overlapping or within the query rectangle. Those vertices 
not overlapping or within the query rectangle are not traversed to. Thus, as 
the traversal iterates, fewer and fewer paths are examined and the resulting 
point of interest vertices within the query rectangle are converged upon. With 
respect to Figure 8, after 3 iterations of /, the traversal will have returned 
all the points of interest within the query rectangle. The first iteration, will 
traverse to the index vertices 2, 3, and 4. The second iteration will traverse 
to the vertices 6, 8 and 9. Note that vertices 5 and 7 do not meet the criteria 
of the ep+ filters. Finally, on the third iteration, the traversal returns vertices 
c, d, and h. Note that vertex i is not returned because it, like 5 and 7, does 
not meet the ep+ filter criteria. A summary of the legal vertices traversed to 
at each iteration is enumerate below. 

1. 2, 3, 4 

2. 6, 9, 8 

3. c, d, h 

There is a more efficient traversal that can be evaluated. If the bound- 
ing box defined by a quadtree vertex is completely subsumed by the query 
rectangle (i.e. not just overlapping), then, at that branch in the traversal, 
the traverser no longer needs to evaluate the ep+-region filters and, as such. 
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can simply iterate all the way down sub-labeled edges to the point of inter- 
est vertices knowing that they are completely within the query rectangle. For 
example, in Figure 8, once it is realized that vertex 9 is completely within 
the query rectangle, then the location properties of vertex d do not need to 
be examined. The functions that define this traversal and the composition 
of these functions into a flow graph is defined below, where A C 'P(V^) is the 
multiset of all quadtree index vertices overlapping or within the query rectan- 
gle, i? C ^ is the multiset of all quadtree index vertices completely within the 
query rectangle, and C C ViV) is the multiset of all point of interest vertices 
overlapping or within the query rectangle. 

f{^) = (e';f'' o 4-^25 o e'-^f'' o e^!p^^° o e*---^<^ o o eL"^ o e.^,) (i) 
h{i) = (e^/.P-'J-d o o eLt+ o eout) (i) 

/.X / trj,>20 tric>25 bl„<45 bU<90 type=poi sub \ / •\ 

s(^)=(^ep^- OEp-- ocp.^- oep_p- o e^'^ o o 61^^+ ° Cout j W 
r{i) = (ep^_f'°"P°' o o ej^^+ o Cout) («) 



/ h 




c 



Function / traverses to those quadtree vertices that overlap or are within the 
query rectangle. Function g allows only those quadtree vertices that are com- 
pletely within the query rectangle. Function h traverses to subsumed quadtree 
vertices. Function s traverses to point of interest vertices that are overlapping 
or within the query rectangle. Finally, function r traverses to subsumed point 
of interest vertices. Note that functions h and r do no check the bounding 
box properties of their domain vertices. As a quadtree becomes large, this be- 
comes a more efficient solution to finding all points of interest within a query 
rectangle. 

The ability to model an index endogenous to a graph allows the domain 
modeler to represent not only objects and their relations (e.g. people and their 
friendships), but also "meta-objects" and their relationships (e.g. index nodes 



In general, disregarding bounding box property checks holds for both quadtree 
vertices and point of interest vertices that are subsumed by a quadtree vertex 
that is completely within the query rectangle. 
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and their subsumptions). In this way, the domain modeler can organize their 
model according to partitions that make sense to how the model will be used 
to solve problems. Moreover, by combining the traversal of an index with the 
traversal of a domain, there exists a single unified means by which problems 
are solved within a graph database — the graph traversal pattern. 

4 Conclusion 

Graphs are a flexible modeling construct that can be used to model a domain 

and the indices that partition that domain into an efficient, searchable space. 
When the relations between the objects of the domain are seen as vertex 
partitions, then a graph is simply an index that relates vertices to vertices by 
edges. The way in which these vertices relate to each other determines which 
graph traversals are most efficient to execute and which problems can be solved 
by the graph data structure. Graph databases and the graph traversal pattern 
do not require a global analysis of data. For many problems, only local subsets 
of the graph need to be traversed to yield a solution. By structuring the graph 
in such a way as to minimize traversal steps, limit the use of external indices, 
and reduce the number of set-based operations, modelers gain great efficiency 
that is difficult to accomplish with other data management solutions. 
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